Predicting C. difficile infection severity from the taxonomic composition of the gut microbiome
Kelly L. Sovacool1, Sarah E. Tomkovich2, Megan L. Coden4, Vincent B. Young2,4, Krishna Rao4, Patrick D. Schloss2,5
1 Department of Computational Medicine & Bioinformatics, University of Michigan
2 Department of Microbiology & Immunology, University of Michigan
3 Department of Molecular, Cellular, and Developmental Biology, University of Michigan
4 Division of Infectious Diseases, Department of Internal Medicine, University of Michigan
5 Center for Computational Medicine and Bioinformatics, University of Michigan
Introduction
- C. difficile infection (CDI) can lead to adverse outcomes including recurrent infections, colectomy, and death1.
- The composition of the gut microbiome plays an important role in determining colonization resistance and clearance when exposed to C. difficile2,3.
- Regression models trained on Electronic Health Records extracted on the day of diagnosis perform modestly well at predicting whether the CDI resulted ICU admission, colectomy, or 30-day mortality (AUROC 0.69)4.
- Identifying the specific microbiome features that distinguish severe CDI cases would allow clinicians to tailor interventions based on a patient’s risk, ultimately leading to better health outcomes.
Dataset
We have 16S amplicon sequence data from 1,191 CDI patient stool samples, with
cases classified as severe or not severe according to three separate definitions:
- IDSA: the Infectious Diseases Society of America (IDSA) definition with severe CDI having a white blood cell count ≥ 15 k/μL and serum creatinine level ≥ 1.5 mg/dL5.
- Attributable: the CDC definition of ICU admission, colectomy, or death occurring within 30 days of CDI, and confirmed as attributable to CDI via clinical chart review.
- All-cause: ICU admission, colectomy, or death occurring within 30 days of CDI, regardless of the cause.
| no |
649 |
513 |
1059 |
| yes |
342 |
26 |
83 |
The attributable severity definition requires chart review by physicians, which has been completed for about half of the cases.
Methods
- Sequences were processed with mothur according to the MiSeq SOP and clustered
into de novo OTUs at a 3% distance threshold6,7.
- We then trained machine learning (ML) models with OTU abundances as features to
predict the IDSA severity, CDI-attributable severity, and all-cause severity of CDI cases using the mikropml R package accompanying snakemake workflow8,9.
Machine learning pipeline

- Prior to model training, the data were pre-processed to scale and center at zero, remove features with near-zero variance, and collapse perfectly correlated features.
- The dataset was randomly split 100 times into training and testing sets with 80% of the
data in the training set.
- On each partition, random forest models were trained with 5-fold cross-validation
repeated 100 times, and performance as the area under the receiver-operator
curve (AUROC) was measured on the held-out testing set for the best model.
- The top 5 most important features contributing to model performance for each model using a permutation test.
Results
Feature importance
The top 5 most important OTUs for predicting each outcome were determined with a permutation test.

Conclusions
- The long tails of the performance distributions for CDI-attributable and all-cause severity may reflect the rarity of severe outcomes according to these definitions.
- That models predicting CDI-attributable severity performed best implies that chart review by physicians is an important step to filter out other causes of complications.
- The poor-to-modest performance of these OTU-based models implies that the taxonomic composition of the microbiome is not the only important factor contributing to severe CDI outcomes.
Future directions
- Using the precision-recall curve (AUPRC) may provide a better estimate of model performance than AUROC as the data are imbalanced.
- Training models with both EHR data and OTUs as features may improve model performance.
Acknowledgements
This research was supported by the National Institutes of Health grant U01AI124255
and the Michigan Institute for Clinical and Health Research Postdoctoral
Translational Scholars Program (UL1TR002240 from the National Center for
Advancing Translational Sciences).